class: center, middle, inverse, title-slide # Introduction to R for Data Analysis ## Data Wrangling Advanced ### Johannes Breuer & Stefan Jünger ### 2021-08-03 --- layout: true --- ## Data wrangling continued 🤠 While in the last sessions we focused on the bread-and-butter tasks of the data preparation business, in this part we will focus on the more 'programmy' side of things. - altering the content of a whole set of variables - conditional variable transformation - formulating logical requests to our data - writing loops --- class: middle **We will largely remain in the world of the `tidyverse` since it makes the steps of wrangling data so transparent and straightforward. However, we will also show that you can easily combine ** --- ## Load the data Again, we will work with the *Public Use File (PUF) of the GESIS Panel Special Survey on the Coronavirus SARS-CoV-2 Outbreak in Germany* as `.csv` file. ```r gp_covid <- read_csv2("./data/ZA5667_v1-1-0.csv") ``` --- --- ## Quickly define missing values ```r library(sjlabelled) gp_covid <- gp_covid %>% set_na(na = c(-99, -77, -33, 98)) ``` --- ## Variables of interest Say, we are interested in the (dis)trust towards several authorities during the Corona crisis. There are 9 items on this topic. Let's create some quick on 3 of them. ```r table(gp_covid$hzcy044a) ``` ``` ## ## 1 2 3 4 5 ## 43 174 329 1250 1269 ``` ```r table(gp_covid$hzcy047a) ``` ``` ## ## 1 2 3 4 5 ## 29 61 188 1054 1763 ``` ```r table(gp_covid$hzcy052a) ``` ``` ## ## 1 2 3 4 5 ## 25 79 303 1422 1278 ``` What if we want to conduct some data reduction method (e.g., PCA) and need the variables in reverse order for interpretation purposes? --- ## Recode data **across** defined variables The `dplyr` package provides a (new) handy tool to exactly this: `across()`. This function can be used to apply another function to multiple variables at once. ```r gp_covid <- gp_covid %>% mutate( across( hzcy044a:hzcy052a, ~recode( .x, `5` = 1, # `old value` = new value `4` = 2, `2` = 4, `1` = 5 ) ) ) ``` --- class: middle ```r table(gp_covid$hzcy044a) ``` ``` ## ## 1 2 3 4 5 ## 1269 1250 329 174 43 ``` ```r table(gp_covid$hzcy047a) ``` ``` ## ## 1 2 3 4 5 ## 1763 1054 188 61 29 ``` ```r table(gp_covid$hzcy052a) ``` ``` ## ## 1 2 3 4 5 ## 1278 1422 303 79 25 ``` --- ## Using `across()` logical conditions Sometimes we are interested in variables that meet certain conditions. For example, for an anylsis, we want to z-standardize all numeric variables in a dataset. Let's create a temporary subset of our data to exemplify such efforts. ```r gp_covid_tmp <- gp_covid %>% select(doi, hzcy044a:hzcy052a) gp_covid_tmp %>% sample_n(5) # randomly sample 5 cases from the df ``` ``` ## # A tibble: 5 x 10 ## doi hzcy044a hzcy045a hzcy046a hzcy047a hzcy048a hzcy049a hzcy050a hzcy051a hzcy052a ## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 10.4232/1.13520 2 3 3 1 3 3 3 2 3 ## 2 10.4232/1.13520 2 2 2 1 2 3 2 2 2 ## 3 10.4232/1.13520 1 3 3 3 3 NA 3 3 3 ## 4 10.4232/1.13520 3 2 2 2 3 2 3 2 2 ## 5 10.4232/1.13520 2 1 2 1 1 2 2 2 1 ``` --- ## z-standardize all numeric variables The `base R` function to z-standardize a variable is `scale()`. ```r gp_covid_tmp <- gp_covid_tmp %>% mutate( across( is.numeric, ~scale(.x) ) ) gp_covid_tmp %>% sample_n(5) ``` ``` ## # A tibble: 5 x 10 ## doi hzcy044a[,1] hzcy045a[,1] hzcy046a[,1] hzcy047a[,1] hzcy048a[,1] hzcy049a[,1] hzcy050a[,1] hzcy051a[,1] hzcy052a[,1] ## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 10.4232/1.13520 -0.920 -0.214 -0.582 -0.722 -0.332 -0.372 -1.18 -1.09 -0.962 ## 2 10.4232/1.13520 -0.920 -0.214 -0.582 -0.722 -0.332 -1.24 -0.179 -0.0341 0.302 ## 3 10.4232/1.13520 0.164 -0.214 -0.582 0.570 -0.332 -0.372 -0.179 -0.0341 0.302 ## 4 10.4232/1.13520 -0.920 -1.33 -1.63 -0.722 -1.32 -1.24 -1.18 -1.09 -0.962 ## 5 10.4232/1.13520 NA NA NA NA NA NA NA NA NA ``` --- ## `dplyr::across()` <img src="data:image/png;base64,#C:\Users\mueller2\talks_presentations\r-intro-gesis-2021\content\img\across_blank.png" width="95%" style="display: block; margin: auto;" /> <small><small>Artwork by [Allison Horst](https://github.com/allisonhorst/stats-illustrations)</small></small> --- ## Aggregate variables across rows Something we might want to do for our analyses is to create aggregate variables, such as sum or mean scores for a set of items. As `dplyr` operations are applied to columns, whereas such aggregations relate to rows (i.e., respondents), we need to make use of the function `rowwise()`. Say, for example, we want to compute a sum score for all measures that respondents have reported to engage in to prevent an infection with or the spread of the Corona virus. ```r gp_covid <- gp_covid %>% rowwise() %>% #<< mutate( sum_measures = sum( c_across(hzcy044a:hzcy052a), na.rm = TRUE ) ) %>% ungroup() ``` --- ## Aggregate variables ```r gp_covid <- gp_covid %>% rowwise() %>% #<< mutate( sum_measures = sum( c_across(hzcy044a:hzcy052a), na.rm = TRUE ) ) %>% ungroup() ``` Three things to note here: 1. `c_across()` is a special version of `across()`for rowwise operations. 2. We use the `ungroup()` function at the end to ensure that `dplyr` verbs will operate the default way when we further work with the `gpc` object. We do not cover grouping in this course (which is especially valuable for summarizing data), but you can check out the [documentation for `group_by()`](https://dplyr.tidyverse.org/reference/group_by.html) to learn more about this. 3. If you only need sums or means, a somewhat faster alternative is using the base `R` functions `rowSums()` and `rowMeans()` in combination with `mutate()` (and possibly also `across()` plus selection helpers). For an explanation why this can be faster, you can read the [online documentation for `rowwise()`](https://dplyr.tidyverse.org/articles/rowwise.html). --- ## Aggregate variables ```r gp_covid %>% select(hzcy044a:hzcy052a) %>% glimpse() ``` ``` ## Rows: 3,765 ## Columns: 9 ## $ hzcy044a <dbl> NA, 1, 2, 2, NA, 2, 2, 2, NA, 1, 3, 1, NA, 2, 3, NA, NA, 1, NA, 2, NA, 1, 1, NA, 2, 1, 2, 1, 2, NA, NA, 1, NA, 2, 2, 2, 1, 3,~ ## $ hzcy045a <dbl> NA, 2, 2, 2, NA, 1, 2, 2, NA, 2, 2, 3, NA, 3, 3, NA, NA, 1, NA, 4, NA, 2, 1, NA, 2, NA, NA, 3, 2, NA, NA, 1, NA, 2, NA, 2, NA~ ## $ hzcy046a <dbl> NA, 2, 1, 2, NA, 2, 2, 2, NA, 3, 4, 4, NA, 3, 3, NA, NA, 1, NA, 4, NA, 3, 2, NA, 2, 2, 3, 3, 2, NA, NA, 3, NA, 2, 4, 3, 3, 3,~ ## $ hzcy047a <dbl> NA, 1, 1, 2, NA, 1, 2, 1, NA, 2, 2, 1, NA, 2, 2, NA, NA, 2, NA, 1, NA, 1, 1, 2, 2, 2, 2, 2, 1, NA, NA, 3, NA, 2, 2, 1, 1, 3, ~ ## $ hzcy048a <dbl> NA, 2, 2, 2, NA, 2, 3, 2, NA, 3, 4, 5, NA, 2, 4, NA, NA, 2, NA, 1, NA, 3, 2, 2, 2, 2, 2, 2, 2, NA, NA, 4, NA, 3, 3, 2, 1, 3, ~ ## $ hzcy049a <dbl> NA, 2, 3, 2, NA, 4, 3, 2, NA, 4, 5, 4, NA, 2, 4, NA, NA, 2, NA, NA, NA, 3, 2, 2, 2, 2, 2, 2, 3, NA, NA, 4, NA, 5, 3, 2, 1, 3,~ ## $ hzcy050a <dbl> NA, 2, 2, 2, NA, 2, 3, 2, NA, 4, 4, 5, NA, 2, 2, NA, NA, 1, NA, 2, NA, 3, 2, 2, 2, 2, 2, 2, 2, NA, NA, 2, NA, 2, 3, 1, 2, 3, ~ ## $ hzcy051a <dbl> NA, 2, 4, 2, NA, 3, 4, 1, NA, 1, 2, 3, NA, 2, 1, NA, NA, 1, NA, 3, NA, 4, 1, 2, 2, 2, 2, 3, 2, NA, NA, 3, NA, 2, 3, 1, 1, 3, ~ ## $ hzcy052a <dbl> NA, 2, 1, 2, NA, 1, 2, 1, NA, 1, 2, 2, NA, 1, 1, NA, NA, 1, NA, 1, NA, 2, 2, 2, 2, 2, 3, 3, 1, NA, NA, 4, NA, 2, 2, 2, 2, 2, ~ ``` --- ## Aggregate variables Rowwise transformations work the same way for means. Here, we create a mean score for the items that ask how much people trust specific people or institutions in dealing with the Corona virus. ```r gp_covid <- gp_covid %>% rowwise() %>% mutate( mean_trust = mean( c_across(hzcy044a:hzcy052a), na.rm = TRUE ) ) %>% ungroup() ``` --- ## Aggregate variables ```r gp_covid %>% select(hzcy044a:hzcy052a, mean_trust) %>% glimpse() ``` ``` ## Rows: 3,765 ## Columns: 10 ## $ hzcy044a <dbl> NA, 1, 2, 2, NA, 2, 2, 2, NA, 1, 3, 1, NA, 2, 3, NA, NA, 1, NA, 2, NA, 1, 1, NA, 2, 1, 2, 1, 2, NA, NA, 1, NA, 2, 2, 2, 1, ~ ## $ hzcy045a <dbl> NA, 2, 2, 2, NA, 1, 2, 2, NA, 2, 2, 3, NA, 3, 3, NA, NA, 1, NA, 4, NA, 2, 1, NA, 2, NA, NA, 3, 2, NA, NA, 1, NA, 2, NA, 2, ~ ## $ hzcy046a <dbl> NA, 2, 1, 2, NA, 2, 2, 2, NA, 3, 4, 4, NA, 3, 3, NA, NA, 1, NA, 4, NA, 3, 2, NA, 2, 2, 3, 3, 2, NA, NA, 3, NA, 2, 4, 3, 3, ~ ## $ hzcy047a <dbl> NA, 1, 1, 2, NA, 1, 2, 1, NA, 2, 2, 1, NA, 2, 2, NA, NA, 2, NA, 1, NA, 1, 1, 2, 2, 2, 2, 2, 1, NA, NA, 3, NA, 2, 2, 1, 1, 3~ ## $ hzcy048a <dbl> NA, 2, 2, 2, NA, 2, 3, 2, NA, 3, 4, 5, NA, 2, 4, NA, NA, 2, NA, 1, NA, 3, 2, 2, 2, 2, 2, 2, 2, NA, NA, 4, NA, 3, 3, 2, 1, 3~ ## $ hzcy049a <dbl> NA, 2, 3, 2, NA, 4, 3, 2, NA, 4, 5, 4, NA, 2, 4, NA, NA, 2, NA, NA, NA, 3, 2, 2, 2, 2, 2, 2, 3, NA, NA, 4, NA, 5, 3, 2, 1, ~ ## $ hzcy050a <dbl> NA, 2, 2, 2, NA, 2, 3, 2, NA, 4, 4, 5, NA, 2, 2, NA, NA, 1, NA, 2, NA, 3, 2, 2, 2, 2, 2, 2, 2, NA, NA, 2, NA, 2, 3, 1, 2, 3~ ## $ hzcy051a <dbl> NA, 2, 4, 2, NA, 3, 4, 1, NA, 1, 2, 3, NA, 2, 1, NA, NA, 1, NA, 3, NA, 4, 1, 2, 2, 2, 2, 3, 2, NA, NA, 3, NA, 2, 3, 1, 1, 3~ ## $ hzcy052a <dbl> NA, 2, 1, 2, NA, 1, 2, 1, NA, 1, 2, 2, NA, 1, 1, NA, NA, 1, NA, 1, NA, 2, 2, 2, 2, 2, 3, 3, 1, NA, NA, 4, NA, 2, 2, 2, 2, 2~ ## $ mean_trust <dbl> NaN, 1.777778, 2.000000, 2.000000, NaN, 2.000000, 2.555556, 1.666667, NaN, 2.333333, 3.111111, 3.111111, NaN, 2.111111, 2.5~ ``` --- ## Conditional transformation Sometimes, things are a bit more complicated. Simple recoding is insufficient when we need to base new variables based on the values of old variable(s). --- ## Simple conditional transformation The simplest version of a conditional variable transformation is using an `ifelse()` statement. ```r gp_covid <- gp_covid %>% mutate( high_education = ifelse(education_cat == 3, "high", "not so high") ) gp_covid %>% select(education_cat, high_education) %>% sample_n(5) ``` ``` ## # A tibble: 5 x 2 ## education_cat high_education ## <dbl> <chr> ## 1 3 high ## 2 3 high ## 3 3 high ## 4 3 high ## 5 3 high ``` .small[ *Note*: A more versatile option for creating dummy variables is the [`fastDummies` package](https://jacobkap.github.io/fastDummies/). ] --- ## Advanced conditional transformation For more flexible (or complex) conditional transformations, the `case_when()` function from `dyplyr` is a powerful tool. ```r gp_covid <- gp_covid %>% mutate( pol_leaning_cat = case_when( between(political_orientation, 0, 3) ~ "left", between(political_orientation, 4, 7) ~ "center", political_orientation > 7 ~ "right" ) ) gp_covid %>% select(political_orientation, pol_leaning_cat) %>% sample_n(5) ``` ``` ## # A tibble: 5 x 2 ## political_orientation pol_leaning_cat ## <dbl> <chr> ## 1 2 left ## 2 6 center ## 3 5 center ## 4 7 center ## 5 5 center ``` --- ## Conditional transformation based on multiple values ```r gp_covid <- gp_covid %>% mutate( pol_leaning_edu = case_when( between(political_orientation, 0, 3) & high_education == "high" ~ "left high", between(political_orientation, 4, 7) & high_education == "high" ~ "center high", political_orientation > 7 & high_education == "high" ~ "right high", TRUE ~ "not so high" ) ) gp_covid %>% select(political_orientation, high_education, pol_leaning_edu) %>% sample_n(5) ``` ``` ## # A tibble: 5 x 3 ## political_orientation high_education pol_leaning_edu ## <dbl> <chr> <chr> ## 1 5 not so high not so high ## 2 4 high center high ## 3 5 not so high not so high ## 4 1 high left high ## 5 5 high center high ``` --- ## `dplyr::case_when()` A few things to note about `case_when()`: - you can have multiple conditions per value - conditions are evaluated consecutively - when none of the specified conditions are met for an observation, by default, the new variable will have a missing value `NA` for that case - if you want some other value in the new variables when the specified conditions are not met, you need to add `TRUE ~ value` as the last argument of the `case_when()` call - to explore the full range of options for `case_when()` check out its [online documentation](https://dplyr.tidyverse.org/reference/case_when.html) or run `?case_when()` in `R`/*RStudio* --- ## `dplyr::case_when()` <img src="data:image/png;base64,#C:\Users\mueller2\talks_presentations\r-intro-gesis-2021\content\img\dplyr_case_when.png" width="95%" style="display: block; margin: auto;" /> <small><small>Artwork by [Allison Horst](https://github.com/allisonhorst/stats-illustrations)</small></small> --- ## Get a bit more programmy So far, all of the previous tasks share two characteristics - based on the structure of the whole dataset - the output is again the whole dataset Particularly in data analysis, our aim is often to extract information from a dataset (e.g., summary statistics, regression estimates).